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Recently there has been an increasing interest in the problem of esti- 
mating a high-dimensional matrix K that can be decomposed in a sum of 
a sparse matrix S* (i.e., a matrix having only a small number of nonzero 
entries) and a low rank matrix L* . This is motivated by applications in com- 
puter vision, video segmentation, computational biology, semantic indexing, 
etc. The main contribution and novelty of the Chandrasekaran, Parrilo and 
Willsky paper (CPW in what follows) is to propose and study a method of 
inference about such decomposable matrices for a particular setting where 
K is the precision (concentration) matrix of a partially observed sparse 
Gaussian graphical model (GGM). In this case, K is the inverse of the co- 
variance matrix of a Gaussian vector Xq extracted from a larger Gaussian 
vector (Xo,Xh) with sparse inverse covariance matrix. Then it is easy to 
see that K can be represented as a sum of a sparse precision matrix S* 
corresponding to the observed variables Xq and a matrix L* with rank at 
most h, where h is the dimension of the latent variables Xjj- If h is small, 
which is a typical situation in practice, then L* has low rank. The GGM 
with latent variables is of major interest for applications in biology or in 
social networks where one often does not observe all the variables relevant 
for depicting sparsely the conditional dependencies. Note that formally this 
is just one possible motivation and mathematically the problem is dealt with 
in more generality, namely, postulating that the precision matrix satisfies 

(1) K = S* + L* 

with sparse S* and low-rank L*, both symmetric matrices. A small amend- 
ment to that inherited from the latent variables motivation is that L* is 
assumed negative definite (in our notation, L* corresponds to —L* in the 
paper). We believe that this is not crucial and all the results remain valid 
without this assumption. 
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CPW propose to estimate the pair (S*,L*) from a n-sample of Xo by the 

pair (S,L) obtained by minimizing the negative log-likelihood with mixed 
£ and nuclear norm penalties; cf. (1.2) of the paper. The key issue in this 
context is identifiability. Under what conditions can we identify S* and 
L* separately? CPW provide geometric conditions of identifiability based 
on transversality of tangent spaces to the varieties of sparse and low-rank 
matrices. They show that, under these conditions, with probability close to 1, 
it is possible to re cover the support of S* , thejank of L* andjo get a bound 
of order 0(y/p/n) on the estimation errors \S — S*\e°° and \\L — L*\\2- Here, 
p is the dimension of Xo and | • \gq and || • ||2 stand for the componentwise 
^-norm and the spectral norm of a matrix, respectively. 

Overall, CPW pioneer a hard and important problem of high-dimensional 
statistics and provide an original solution both in the theory and in numer- 
ically implementable realization. While being the first work to shed light on 
the problem, the paper does not completely raise the curtain and several 
aspects still remain to be understood and elucidated. 

The nature of the results. The most important problem for current ap- 
plications appears to be the estimation of S* or the recovery of its support. 
Indeed, the main interest is in the conditional dependencies of the coordi- 
nates of Xo in the complete model (Xo, Xh) and this information is carried 
by the matrix S* . In this context, L* is essentially a nuisance, so that bounds 
on the estimation error of L* and the recovery of the rank of L* are of rela- 
tively moderate interest. However, mathematically, the most sacrifice comes 
from the desire to have precise estimates of L*. Indeed, if S n and £ d enote 
the empirical and population covariance matrices, the slow rate 0(y/p/n) 
comes from the bound on ||E n — S||2 m Lemma 5.4, that is, from the stochas- 
tic error corresponding to L*. Since the sup-norm error |S n — T,\foo is of order 
\J (logp) Jn, can we get a better rate when solely focusing on \S — S*\i^,? 

Extension to high dimensions. The results of the paper are valid and 
meaningful only when p < n. However, for the applications of GGM, the 
case p S> n is the most common. A key question is whether the restriction 
p < n is intrinsic, that is, whether it is possible to have results on S* in 
model (1) when p> n. Since the traditional model with sparse component 
S* alone is still tractable when p>n, a related question is whether intro- 
ducing the model (1) with two components and estimating both S* and L* 
gives any improvement in the p> n setting as compared to estimation in 
the model with a sparse component alone. A small simulation study that 
we provide below suggests that already for p = n, including the low-rank 
component in the estimator may yield no improvement as compared to tra- 
ditional sparse estimation without the low-rank component, although this 
low-rank component is effectively present in the model. 
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Optimal rates. The paper obtains bounds of order 0(\/p/n) on the es- 
timation errors \S — S*\i°o and \\L — L*\\2 with probability 1 — 2exp(— p). 
Can we achieve a better rate than y/p/n when solely focusing on the re- 
covery of S* with the usual probability 1 — p~ a for some a > 0? Is the rate 
y/p/n optimal in a minimax sense on some class of matrices? Note that one 
should be careful in defining the class of matrices because in reality the rate 
is not 0(y/p/n) but rather 0(tpy/p/n), where ip is the spectral norm of S 
depending on p. It can be large for large p. Surprisingly, not much is known 
about the optimal rates even in the simpler case of purely sparse precision 
matrices, without the low-rank component. In this case, [1, 7] and [8] pro- 
vide some analysis of the upper bounds on the estimation error of different 
estimators and under different sets of assumptions on the precision matrix. 
All these bounds are of "order" 0(w (logj>)/n), but again one should be 
very careful here because of the factors depending on p that multiply this 
rate. In [1], the factor is the squared i 1 — > £ l norm of the precision matrix 
while in [7], it is the squared degree of the graphical model multiplied by 
some combinations of powers of matrix norms that are not easy to interpret. 
The most recent paper [8] obtains the rate 0{d\J (logp)/n), where d is the 
degree of the graph for ^°°-bounded precision matrices. An open problem is 
to find optimal rates of convergence on classes of precision matrices defined 
via sparsity and low rank characteristics. The same problem makes sense 
for covariance matrices. Here, some advances have been achieved very re- 
cently. In particular, some optimal rates of estimation of low-rank covariance 
matrices are provided by [5]. 

The assumptions of the paper are stated in terms of some inaccessible 
characteristics such as and and seem to be very strong. They are 
in the spirit of the irrepresentability condition for the vector case used to 
prove model selection consistency of the Lasso. For a given set of data, there 
is no means to check whether these assumptions are satisfied. What happens 
when they do not hold? Can we still have some convergence properties under 
no assumption at all or under weaker assumptions akin to the restricted 
eigenvalue condition in the vector case? 

Choice of the tuning parameters. The choice of parameters (7, A n ) ensur- 
ing algebraic consistency in Theorem 4.1 depends on various unknown quan- 
tities. Proposing a reasonable data-driven selector for (7, A n ) (e.g., similarly 
to [4] for the pure sparse setting) would be very helpful for the practice. 

Alternative methods of estimation. Constructively, the method of CPW 
is obtained from the GLasso of [2] by adding a penalization by the nuclear 
norm of the low-rank component. Similar low-rank extensions can be readily 
derived from other methods, such as the Dantzig type approach of [1] and the 
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regression approach of [3, 6]. Consider a Gaussian random vector X £~R. P 
with mean and nonsingular covariance matrix E. Let K = X -1 be the 
precision matrix. We assume that K is of the form (1) where S* is sparse 
and L* has low rank. 

(a) Dantzig type approach. In the spirit of [1], we may define our estimator 
as a solution of the following convex program: 

(2) (S,L) = argmin{|S| £ i +//||L||*}, 

(S,L)eg 

where || • ||* is the nuclear norm, Q = {(S,L) : |E n (5 + L) — I\^ < A} and 
fi, A > are tuning constants. Here, the nuclear norm is a convex 

relaxation of the rank of L*. 

(b) Regression approach. The regression approach [3, 6] is an alternative 
point of view for estimating the structure of a GGM. In the pure sparse 
setting, some numerical experiments [9] suggest that it may be more reli- 
able than the ^-penalized log-likelihood approach. Let diag(A) denote the 
diagonal of square matrix A and ||^4||f its Frobenius norm. Defining 

0= argmin ||S 1/2 (I - A)\\ 2 F , 
A:diag(A)=0 

we have G = K A + /, where I is the identity matrix and A is the diagonal 
matrix with diagonal elements A™ = —1/Kjj for j = 1, . . . ,p. Thus, we have 
the decomposition 

Q = S + L, where S = S*A + I and L = L*A. 

Note that rank(Z) =rank(L*) and the nondiagonal elements Sij of matrix 

5 are nonzero only if S*j is nonzero. Therefore, recovering the support of S* 
and rank(L*) is equivalent to recovering the support of S and rank(L). 

Now, we estimate (S, L) from an n-sample of X represented as an n x p 
matrix X. Noticing that the sample analog of ||E 1 / 2 (/ — A)\\"jp is ||X(7 — 
J 4)|| 2 7 /n and using the decomposition = S + Z, we arrive at the following 
estimator: 

(3) (S,L)= argmin (^||X(J - S - L)\\ 2 F + X\S\ e i >oS + j*||XL 

(5,L):diag(S+L)=0 I 1 

where /i, A are positive tuning constants and |<S , |^i )0 fr = Si^j Note that 
here the low-rank shrinkage is driven by the nuclear norm ||XL||* rather 
than by The convex minimization in (3) can be performed efficiently 

by alternating block descents on the off-diagonal elements of S, the matrix 
L and the diagonal of S. The off-diagonal support of S* is finally estimated 
by the off-diagonal support of S. 
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Power versus FDR for h around 3 
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Fig. 1. Each color corresponds to a fixed value of fi, the solid-black color being for 
fi = +oo. For each choice of fi, different quantities are plotted for a series of values of A. 
Left: Mean rank of XL. Middle: The curve of estimated power versus estimated FDR. 
Right: The power versus FDR for the estimators fulfilling E[rank(XL)] ~ h — 3 (red dots), 
superposed with the Power versus the FDR for fi = +00 (in solid-black). 

Numerical experiment. A sparse Gaussian graphical model in R 30 is gen- 
erated randomly according to the procedure described in Section 4 of [4]. 
A sample of size n = 30 is drawn from this distribution and X is obtained 
by hiding the values of 3 variables. These 3 hidden variables are chosen ran- 
domly among the connected variables. The estimators (5, L) defined in (3) 
are then computed for a grid of values of A and \i. The results are summa- 
rized in Figure 1 (average over 100 simulations). 

Strikingly, there is no significative difference in these examples between 
the procedure of [6] (corresponding to [i = +00, in solid-black) and the pro- 
cedure (3) that includes the low-rank component (corresponding to finite //). 
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